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*vj Abstract 

5^ A scattering transform defines a locally translation invariant represen- 

£L| tation which is stable to time-warping deformations. It extends MFCC 

■^T representations by computing modulation spectrum coefficients of multi- 

ple orders, through cascades of wavelet convolutions and modulus opera- 
tors. Second-order scattering coefficients characterize transient phenom- 
ena such as attacks and amplitude modulation. A frequency transposition 
i | invariant representation is obtained by applying a scattering transform 

f~^\ along log-frequency. State-the-of-art classification results are obtained for 

r/\ musical genre and phone classification on GTZAN and TIMIT databases, 

respectively. 

^i Keywords: Audio classification, deep neural networks, MFCC, modulation 

spectrum, wavelets. 
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m 1 Introduction 

J- A major difficulty of audio representations for classification is the multiplicity 

of information at different time scales: pitch and timbre at the scale of mil- 
^" liseconds, the rhythm of speech and music at the scale of seconds, and the 

;~^ music progression over minutes and hours. Mel-frequency cepstral coefficients 

y __ i (MFCCs) are efficient local descriptors at time scales up to 25ms. Capturing 

larger structures up to 500ms is however necessary in most applications. This 
paper studies the construction of stable, invariant signal representations over 
S^ such larger time scales. We concentrate on audio applications, but introduce a 

?H generic scattering representation for classification, which applies to many signal 

modalities beyond audio (12) . 

Spectrograms compute locally invariant descriptors over time intervals lim- 
ited by a window. Section [2] shows that high-frequency spectrogram coefficients 
are not stable to variability due to time-warping deformations, which occur 
in most signals, particularly in audio. MFCCs average spectrogram values over 
mel-frequency bands, which improves stability to time warping but also removes 
information. Over time intervals larger than 25ms, the information loss becomes 
too important, which is why MFCCs are limited to such short time intervals. 
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Figure 1: (a) Spectrogram log|x(i,w)| for a harmonic signal x(t) (centered 
in to) followed by log \x^(t,uj)\ for x T {t) — x{{\ — e)t) (centered in t\), as a 
function of t and w. The right graph plots log \x(t , u>)\ (blue) and log |av(ii,w)| 
(red) as a function of u>. Their partials do not overlap at high frequencies, (b) 
Mel-frequency spectrogram logMx(t,u>) followed by logMx T (t,cj). The right 
graph plots \ogMx(t ,oj) (blue) and \ogMx T {ti,uj) (red) as a function of w. 
With a mel-scale frequency averaging, the partials of x and x T overlap at all 
frequencies. 



Modulation spectrum decompositions [2j[17j[23j[26j|32j|36j|37||40j|42] characterize 
the temporal evolution of mel-frequency spectrograms over larger time scales, 
with autocorrelation or Fourier coefficients. However, this modulation spec- 
trum also suffers from instability to time-warping deformation, which impedes 
classification performance. 

Section [3] shows that the information lost by mel-frequency spectral coeffi- 
cients can be recovered with multiple layers of wavelet coefficients, which arc 



stable to time-warping deformations. A scattering transform 31 computes such 



a cascade of wavelet transforms and modulus non-linearities. Its computational 
structure is similar to a convolutional deep neural network 3pTpTp4p5]|27|34 



but involves no learning. It outputs time-averaged coefficients, providing infor- 
mative signal invariants over potentially large time scales. 

A scattering transform has striking similarities with physiological models of 
the cochlea and of the auditory pathway 11 15], also used for audio process- 
ing [33] . Its energy conservation and other mathematical properties are reviewed 
in Section [4] An approximate inverse scattering transform is introduced in Sec- 
tion [5] with numerical examples. Section [6] relates the amplitude of scattering 
coefficients to audio signal properties. These coefficients provide accurate mea- 
surements of frequency intervals between harmonics and also characterize the 
amplitude modulation of voiced and unvoiced sounds. The logarithm of scat- 
tering coefficients linearly separates audio components related to pitch, formant 
and timbre. 

Frequency transpositions form another important source of audio variabil- 
ity, which should be kept or removed depending upon the classification task. 
For example, speaker-independent phone recognition requires some frequency 
transposition invariance, while frequency localization is necessary for speaker 
identification. Section [7] shows that cascading a scattering transform along 
log-frequency yields a transposition invariant representation which is stable to 
frequency deformation. 



Scattering representations have proved useful for image [5 , 39 and audio 



lpjflO classification. Section [8] explains how to adapt and optimize the amount 
of time and frequency invariance for each signal class, at the supervised learn- 
ing stage. A time and frequency scattering representation is used for musical 
genre classification over the GTZAN database, and for phone classification over 
the TIMIT corpus. State-of-the-art results are obtained with a Gaussian kernel 
SVM applied to scattering feature vectors. All figures and results are repro- 



ducible using a MATLAB software package, available at http://www.di.ens.fr/signal/scattering/ 



2 Mel-frequency Spectrum 



Section [2TT] shows that high-frequency spectrogram coefficients are not stable to 
time-warping deformation. The mel-frequency spectrogram stabilizes these co- 
efficients by averaging them along frequency, but loses information. To analyze 
this information loss, Section [272] relates the mel-frequency spectrogram to the 
amplitude output of a filter bank which computes a wavelet transform. 

2.1 Fourier Invariance and Deformation Instability 

Let x(lj) — J x(u)e~ luu du be the Fourier transform of x. If x c (t) = x(t — c) 
then x c (oj) = e~ lcu) x{lu) . The Fourier transform modulus is thus invariant to 
translation: 

\x c {u)\ = \x{u)\ . (1) 

A spectrogram localizes this translation invariance with a window (f> of duration 
T such that J <f>(u)du — 1. It is defined by 



\x(t,u)\ 



/«„)«„-.).—* 



(2) 



If \c\ <§C T then one can verify that |aT c (i,o;)| « |x(4,w)|. 

Suppose that x is not just translated but time- warped to give x T (t) — x{t — 
r(t)) with |t'(£)| < 1. A representation $(x) is said to be stable to deformation 
if its Euclidean norm ||$(x) — $(x r )|| is small when the deformation is small. The 
deformation size is measured by sup t |r'(i)|. If it vanishes then it is a "pure" 
translation without deformation. Stability is formally defined as a Lipschitz 
continuity condition relatively to this metric. It means that there exists C > 
such that for all r with sup t |t'(£)| < 1 

||*(aO-4(s r )||<Csup|T'(t)||M| • (3) 

t 

A Fourier modulus representation $(x) = \x\ is not stable to deformation 
because high frequencies are severely distorted by small deformations. For ex- 
ample, let us consider a small dilation r(t) — et with < e <C 1. Since r'(t) = e, 
the Lipschitz continuity condition ffik becomes 

\\\x\-\x7\\\<Ce\\x\\. (4) 



The Fourier transform of x T (t) — x((l — e)i) is x^(u>) = (1 — e) _1 x((l — e) _1 oj). 
This dilation shifts a frequency ujq by e|wo|- For a harmonic signal x(t) = 
g(t) J2 n a n cos(n^i), the Fourier transform is a sum of partials 

S(w) = £^r(?(w-n0+3(w + n0)- (5) 



After time- warping, each partial g(uj ± n£) is translated by en<|£|, as shown in 
the spectrogram of Figure flTa) . Even though e is small, at high frequencies 
ne becomes larger than the bandwidth of g. Consequently, the supports of 
deformed harmonics g(uj(l — e) _1 — n£) do not overlap those of the original 
harmonics g{uj—n^) and hence induce a large Euclidean distance. The Euclidean 
distance of |x| and \x T \ thus does not decrease proportionally to e if the harmonic 
amplitudes a n are sufficiently large at high frequencies. This proves that the 
continuity condition Q is not satisfied. 

The autocorrelation Rx(u) = J x(t) x*(t—u) dt is also a translation invariant 
representation which has the same deformation instability as the Fourier trans- 
form modulus. Indeed, Rx(lo) = |a;(u;)| 2 so ||i?x — -Rav|| = (27r) _1 |||a;| 2 — |x r | 2 ||. 

2.2 Mel-frequency Deformation Stability and Filter Banks 

A mel-frequency spectrogram averages the spectrogram energy with mel-scale 
filters ip\, where A is the center frequency of each ip\(uj): 

Mx(t, A) = i- J \x(t,L;)\ 2 \Moj)\ 2 diu . (6) 

The band-pass filters %p\ have a constant-Q frequency bandwidth, with a support 
centered at A whose size is proportional to A. At the lowest frequencies, instead 
of being constant-Q, the bandwidth of i/'a remains equal to 2ir/T so that ^a(^) 
is mostly localized in a time interval of size T. 

The mel-frequency averaging removes deformation instability. Indeed, larger 
displacements of high frequencies are compensated by the wider averaging by 
the kernel |^a(w)| 2 . After this averaging, Figurelllb) shows that the partials of a 
harmonic signal x and the partials of its dilation x T still overlap at high frequen- 
cies. As opposed to spectrograms, a mel-frequency representation $(x) = Mx 
is Lipschitz stable to deformations in the sense of pi. 

This time-warping stability is due to the mel-scale averaging. However, this 
averaging loses information. We show that this frequency averaging can be 
rewritten as a time averaging of a filter bank output. Since x(t, ui) in (pi) is the 
Fourier transform of xt(u) = x(u)<j){u — t), applying Plancherel's formula gives 

Mx(t,A) = i r [ \x t (uj)\ 2 $ x (uj)\ 2 dw (7) 

\x t *Mv)\ 2 dv (8) 



x(u)4>(u — t)ip\(v — u)du 



dv (9) 
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Figure 2: (a): Scalogram log |a: * ?/>a(£)| 2 for a musical signal, as a function of £ 
and A. (b): Averaged scalogram log\x-kip\\ 
duration T — 190ms. 



• (j) (t) with a lowpass filter </> of 



If A ^ Qi/T then (j>(t) is approximately constant on the support of ip\(t), so 
4>(u — t)ip\(v — u) ~ 0(w — t)ip\(v — u), and hence 



Ma;(£, A) « / l x(u)ij)\{v — u)du 
= \x*il> x \ 2 *\<t>\ 2 {t). 



\<j>{v-t)\ 2 dv 



(10) 
(11) 



The frequency averaging of the spectrogram is thus nearly equal to the time 
Section 



3.1 



studies the properties of the constant-Q filter 
which defines an analytic wavelet transform. 

2 



averaging of \x*ip\ 
bank {4> x }> 

Figures J2J(a) and[2jb) display \x-kip\\ 2 and \x* Va| 2 * |</>| 2 , respectively, for 
a musical recording. The window duration is T = 190ms. This time averaging 
removes fine-scale information such as vibratos and attacks. To reduce infor- 
mation loss, a mel-frequency spectrogram is often computed over small time 
windows of about 25ms. As a result, it does not capture large-scale structures, 
which limits classification performance. 

To increase T without losing too much information, it is necessary to capture 
the high frequencies of | cc -*- Va | - This can be done with a modulation spectrum. 
The modulation spectrum can be defined as the spectrogram 2, 23, 32, 37 of 
Ix + V'.xIj or as its short-time autocorrelation [36jEo]. However, these modula- 
tion spectra are unstable to time-warping deformations. Indeed, a time-warping 
of x induces a time- warping of \x * ipx\: an d Section 2.1 showed that spectro- 



grams and autocorrelations have deformation instabilities. Constant-Q averaged 



modulation spectra 17 42 stabilize spectrogram representations with another 



averaging along modulation frequencies. According to (111, this can also be 



computed with a second constant-Q filter bank. The scattering transform fol- 
lows this latter approach. 



3 Wavelet Scattering Transform 

A scattering transform recovers the information lost by a mel-frequency averag- 



ing with a cascade of wavelet decompositions and modulus operators 31 . It is 
locally translation invariant and stable to time- warping deformation. Important 
properties of constant-Q filter banks are first reviewed in the framework of a 
wavelet transform, and the scattering transform is introduced in Section |3.2| 

3.1 Analytic Wavelet Transform and Modulus 

Constant-Q filter banks compute a wavelet transform. We review the properties 
of complex analytic wavelet transforms and their modulus, which are used to 
calculate mel-frequency spectral coefficients. 

A wavelet ijj(t) is a band-pass filter with V>(0) — 0. We consider analytic 
wavelets such that ip(i>j) — for u < 0. As a result, ip(t) is a complex quadrature 
phase wavelet. For any A > 0, a dilated wavelet of center frequency A is written 

ip\(t) = Xip(Xt) and hence $a(w) = $(y) • ( 12 ) 

The center frequency of ip is normalized to 1. In the following, we denote 
by Q the number of wavelets per octave, which means that A = 2 J/,( ^ for j £ Z. 
The bandwidth of ip is of the order of Q _1 , to cover the whole frequency axis 
with these band-pass wavelet filters. The support of ip\{uj) is centered in A 
with a frequency bandwidth X/Q whereas the energy of i(>\(t) is concentrated 
around in an interval of size 2irQ/X. To guarantee that this interval is smaller 
than T, we define Va with Q only for A > 2ttQ/T. For A < 2nQ/T, the 
lower frequency interval [0, 2ttQ/T] is covered with about Q — 1 equally-spaced 
filters \j)\ with constant frequency bandwidth 2ir/T. For simplicity, these lower- 
frequency filters are still called wavelets. We denote by A the grid of all wavelet 
center frequencies A. 

The wavelet transform of x computes a convolution of x with a low-pass filter 
4> of frequency bandwidth 27r/T, and convolutions with all higher-frequency 
wavelets ip\ for A e A: 

Wx=(x*<t>(t),x*ipx(tj) . (13) 

V /teR.AeA 

This time index t is not critically sampled as in wavelet bases so this represen- 
tation is highly redundant. The wavelet ip and the low-pass filter cj) are designed 
to build filters which cover the whole frequency axis, which means that 

A(u) = |0H| 2 + X -J2 (\Mu)f + |<M-^)| 2 ) (14) 

satisfies, for all ui £ K: 



AeA 



1 - a < A(u) < 1 with a < 1 . (15) 



This condition implies that the wavelet transform W is a stable and invertible 
operator. Multiplying (15) by |a?(w)| 2 and applying the Plancherel formula 



30 



gives 

(1 - a)\\x\\ 2 < \\Wx\\ 2 < \\x\\ 2 , (16) 

where ||a;|| 2 = J \x{t)\ 2 dt and the squared norm of Wx sums all squared coeffi- 
cients: 

\\Wx\\ 2 = f\x*<f>(t)\ 2 dt + ^ [\x*ip x (t)\ 2 dt . 



The upper bound (16) means that W is a contractive operator and the lower 
bound implies that it has a stable inverse. One can also verify that the pseudo- 
inverse of W recovers x with the following formula 

x(t) = (x*4>)*f(t) + ^2Real((x-ki/}x)-kiJ x (t)) , (17) 

AeA 

with reconstruction filters defined by 



where z* is the complex conjugate of z S C. If a = in ( 15 ) then W is unitary, 
?(*) = <j>{-t) and ^ x (t) = r x (-*)• 



Following (11), mel- frequency spectrograms can be approximated using a 
non-linear wavelet modulus operator which removes the complex phase of all 
wavelet coefficients: 

Wx=(x*<j>(t),\x-kij)x{t)\\ ■ (19) 

V /teR,AeA 

A signal cannot be reconstructed from the modulus of its Fourier transform 
but the situation is different for a wavelet transform which is highly redundant. 
Despite the Joss of phase, for particular families of analytic wavelets, one can 

The 
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prove that W is an invertible operator with a continuous inverse 
modulus thus does not lose information. 

This operator W is also contractive. Indeed, the wavelet transform W is 
contractive and the complex modulus is contractive in the sense that | \a\ — \b\ | < 
\a — b\ for any (a, b) G C 2 so 

\\Wx - Wx'\\ 2 < \\Wx - Wx'\\ 2 < \\x - x'\\ 2 . 

If W is a unitary operator then ||Wa;|| = || Wx\\ = \\x\\ so W preserves the signal 
norm. 

One may define an analytic wavelet with an octave resolution Q as 4>(t) = 
e lt 9(t) and hence ^(ui) = 6(oj — 1) where 8 is the transfer function of a low- 
pass filter whose bandwidth is of the order of Q~ l . If 9 is a Gaussian then rp 
is a complex Gabor function which is almost analytic because |"0( w )l is small 



Figure 3: Gabor wavelets i/j\(ui) with Q = 8 wavelets per octave, for different 
A. The low frequency filter <p(u) (in red) is a Gaussian. 

but not strictly zero for u> < 0. Figure [3J shows Gabor wavelets ?p\ with Q = 
8. In this case cf> is also a Gaussian. Morlet wavelets are modified Gabor 
wavelets V>(w) = #( w ~ 1) — #( W )0(1)/#(O) which guarantees that ip(0) = 0, 
and 4> remains a Gaussian. For Q = 1, unitary wavelet transforms can also be 
obtained by choosing ip to be the analytic part of a real wavelet which generates 



an orthogonal wavelet basis, such as a cubic spline wavelet 31 



3.2 Deep Scattering Network 

We showed in pTj ) that mel-frequency spectral coefficients Mx(t, A) are approx- 
imatively equal to averaged squared wavelet coefficients Ix^^aI 2 * I0| 2 (O- The 
time averaging by the low-pass filter <f> provides descriptors that are locally in- 
variant to small translations relative to the duration T of <p. To avoid amplifying 
large amplitude coefficients, a scattering transform computes |s*^| * </>(£) in- 
stead. It then recovers the information lost to time averaging by calculating 
additional invariant coefficients with a cascade of wavelet modulus transforms. 
The simplest locally invariant descriptor of x is given by its time-average 
Sox(t) — x-k(/)(t), which removes all high frequencies. Complementary high- 
frequency information are provided by a first wavelet modulus transform 



W lX = (x*<f>(t), |x*Va,(*)|) 



tGR,AiSAi 



computed with wavelets ip\ x having an octave frequency resolution Q\. For 
audio signals we set Qi = 8, which defines wavelets having the same frequency 
resolution as mel-frequency filters. Audio signals have little energy at low fre- 
quencies so Sox(t) ss 0. Mel-frequency spectral coefficients are obtained by 
averaging the wavelet modulus coefficients with 0: 

Sia;(i,Ai) = k* V-aJ*^*) ■ (20) 

These are called first-order scattering coefficients. They are computed with a 
second wavelet modulus transform applied to each \x-ki/^x 1 1, which also provides 
complementary high frequency wavelet coefficients: 



W 2 \x-kip Xl \ = (jx*-0 Al |*0, ||x*t/; Ai |*Va 2 | 



a 2 ga 2 



The wavelets ip\ 2 have an octave resolution Q2 which may be different from Q\. 
It is chosen to get a sparse representation which means concentrating the signal 
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Figure 4: A scattering transform iterates on wavelet modulus operators W m to 
compute cascades of m wavelet convolutions and modulus stored in U m x, and 
to output averaged scattering coefficients S m x. 



information over as few wavelet coefficients as possible. These coefficients are 
averaged by (j> to obtain translation invariant coefficients, which defines second- 
order scattering coefficients: 

S 2 x(t, Ai,A 2 ) = \\x*ipxi\*^x 3 \* < f , (' t ) ■ 

These averages are computed by applying a third wavelet modulus transform 
W3 to each | |a; • -0Ai | * ipXi\- It computes their wavelet coefficients through 
convolutions with a new set of wavelets ip\ 3 having an octave resolution Q3. 
Iterating this process defines scattering coefficients at any order m. 
For any m > 1, iterated wavelet modulus convolutions are written: 

!7 m x(t,Ai,...,A TO ) = I \\x*ipx 1 \*...\*ipx m (t)\ , 
where rath-order wavelets ip\ m have an octave resolution Q mi and satisfy the 



stability condition (15). Averaging U m x with </> gives scattering coefficients of 
order m: 



b m x(t, Ai, ..., A m ) 



|||o:*^ Al |*...|*^ Am |*^(t) 

U m x(.,\i,...,\m,)*<f>(t) . 



Applying W m +x on U m x computes both S m x and C/ m +ix: 
W m+ \U m x = (S m X, U m+1 x) . 



(21) 



A scattering decomposition of maximal order m is thus defined by initializing 
UqX = x, and recursively computing (21 1 for < m < rfl. This scattering 



transform is illustrated in Figure |4j The final scattering vector aggregates all 
scattering coefficients for < m < fn: 



OX — {Oin x )0<m<m- 



(22) 



The scattering cascade of convolutions and non-linearities can also be inter- 
preted as a convolutional network (25), where U m x is the set of coefficients of 



the mth internal network layer. These networks have been shown to be highly 
effective for audio classification (3j[l4|[2TJ 24, 27, 34]. However, unlike standard 
convolutional networks, each such layer has an output S m x = U m x*(f), not just 
the last layer. In addition, all filters are predefined wavelets and are not learned 
from training data. 

The wavelet octave resolutions are optimized at each layer m to produce 
sparse wavelet coefficients at the next layer. This better preserves the signal 
information as explained in Section [5] For audio signals x, we choose Q\ = 8 
wavelets per octave, which corresponds to a mel-frequency decomposition. This 
configuration has been shown to provide sparse representations of a mix of 
speech, music and environmental signals Hy. Such signals x often include har- 
monics which have sparse representations if the mother wavelet has sufficiently 
narrow frequency support. In addition, sparse representations have been shown 



to be better suited for classification 22 35 



At the second order, choosing Q 2 = 1 defines wavelets with more narrow 
time support, which are better adapted to characterize transients and attacks 
in the second-order modulation spectrum. Section [6] shows that musical signals 
including modulation structures such as tremolo may however require wavelets 
having better frequency resolution, and hence Q 2 > 1. At higher orders m > 3 
we set Q m — 1 in all cases, but we will see that these coefficients can often be 
neglected. 

The scattering cascade has similarities with several neurophysiological mod- 
els of auditory processing, which incorporate cascades of constant-Q filter banks 
followed by non-linearities [TT|[l5] . The first filter bank with Q\ = 8 models the 
cochlear filtering, whereas the second filter bank corresponds to later processing 



in the auditory pathway, which are modeled using filters with Q 2 = 1 11 15 



4 Scattering Properties 

We briefly review important properties of scattering transforms, including sta- 
bility to time-warping deformation, energy conservation and an algorithm for 
fast computation. 

4.1 Time- Warping Stability 

The Fourier transform is unstable to deformation because dilating a sinusoidal 
wave yields a new sinusoidal wave of different frequency which is orthogonal to 
the original one. Section [2] explains that mel-frequency spectrograms become 
stable to time-warping deformation with a frequency averaging. One can prove 



31 that a scattering representation $(x) = Sx satisfies the Lipschitz continuity 
condition p| relative to deformations because wavelet transforms are stable to 
deformation. Indeed, wavelets are regular and well-localized in time, and a small 
deformation of a wavelet yields a function which is highly similar to the original 



one. 



10 



The squared Euclidean norm of a scattering vector Sx is the sum of its 
coefficients squared at all orders: 



^ll 2 



ii^ii 2 = Eii 5 « 

m=0 
ra „ 

= J] 53 \SmX(t,Xl,...,\m)\ 2 dt. 

m=0Ai,...,A ro ^ 

We consider deformations x T (i) = a;(i — r(t)) with |r'(t)| < 1 and sup t \r(i)\ -C 
T, which means that the maximum displacement is small relatively to the band- 
width of 4>. One can prove [31] that there exists a constant C such that for all 
x and any such r: 

||Sa; T - 5x11 < C sup \r'(t)\ ||x|| , (23) 

t 

up to second-order terms. This Lipschitz continuity property implies that time- 
warping deformations can be locally linearized in the scattering space. Indeed, 
Lipschitz continuous operators are almost everywhere differentiable. Invariants 
to small deformations can thus be computed with linear operators in the scat- 
tering domain. This property is particularly important for linear discriminant 
classifiers such as support vector machines (SVMs). 

4.2 Contraction and Energy Conservation 

We show that a scattering transform is contractive and can preserve energy. Let 
us define the squared Euclidean norm || Ax\\ 2 of a vector of coefficients Ax (such 
as W m x, S m x, U m x or Sx) as the sum of all its coefficients squared. 

Since Sx is computed by cascading wavelet modulus operators W m , which 
are all contractive, it results that S is also contractive: 

\\Sx-Sx'\\ < \\x-x'\\ . (24) 

A scattering transform is therefore stable to additive noise. 

If each wavelet transform is unitary, each W m preserves the signal norm. 
Applying this property to W m +iU m x = (S m x , U m +ix) yields 

\\U m x\\ 2 = \\S m x\\ 2 + \\U m+1 x\\ 2 . (25) 

Summing these equations < m < m proves that 



x 



|2 nc> I|2 , M rr „||2 



Sx\\' + \\U m+1 x\\' . (26) 



Under appropriate assumptions on the mother wavelet ip, one can prove that 
||C/m+ix|| goes to zero as m increases [31] , which implies that \\Sx\\ = \\x\\ for 
rfi = oo. This property comes from the fact that the modulus of analytic wavelet 
coefficients computes a smooth envelope, and hence pushes energy towards lower 
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T 


TO = 


m = 1 


TO = 2 


TO = 3 


23ms 


0.1% 


97.4% 


2.4% 


0.2% 


93ms 


0.0% 


89.6% 


8.3% 


0.7% 


370ms 


0.0% 


71.7% 


23.7% 


2.8% 


1.5 s 


0.0% 


57.3% 


32.9% 


7.2% 



Table 1: Averaged values ||5 m x|| 2 /||x|| 2 computed for signals x in the GTZAN 
music dataset 43 , as a function of order to and averaging scale T. For to = 1, 
S m x is calculated by Gabor wavelets with Q\ = 8, and for m = 2, 3 by cubic 
spline wavelets with Q2 = Q3 = 1. 



frequencies. By iterating on wavelet modulus operators, the scattering trans- 
form progressively propagates all the energy of U m x towards lower frequencies, 
which is captured by the low-pass filter of scattering coefficients S m x = U m x-k<f>. 
One can verify numerically that ||C/yn+ix|| converges to zero exponentially 
when to goes to infinity and hence that \\Sx\\ converges exponentially to ||x||. 
Table [l] gives the fraction of energy ||5 m x|| 2 /||x|| 2 absorbed by each scattering 
order. Since audio signals have little energy at low frequencies, S$x is very small 
and most of the energy is absorbed by S\x for T = 23ms. This explains why 
mel-frequency spectrograms are typically sufficient at these small time scales. 
However, as T increases, a progressively larger proportion of energy is absorbed 
by higher-order scattering coefficients. For T — 370ms, about 24% of the signal 
energy is captured in S2X. Section [6] shows that at this time scale, important 
amplitude modulation information is carried by these second-order coefficients. 
For T — 370ms, S3X carries only 3% of the signal energy. It increases as T 
increases, but for audio classification applications studied in this paper, T re- 
mains below 370ms, so these third-order coefficients have a negligible role. We 
therefore concentrate on second-order scattering representations: 



Sx = 



(S x(t) , Six(t, Ai) , S 2 x{t, Ai, A 2 ) 



tMM 



(27) 



4.3 Fast Scattering Computation 

Subsampling scattering vectors provide a reduced representation, which leads 
to a faster implementation. Since the averaging window </> has a duration T, we 
compute scattering vectors at t = kT/2 for every integer k. 

We suppose that x(t) has N samples over each frame of duration T, and 
is thus sampled at a rate N/T. For each time frame t — kT/2, the number 
of first-order filters is about Q\ log 2 N so there are about Q\ log 2 -/V first-order 
coefficients Six(£, Ai). We now show that the number of non-negligible second- 
order coefficients S 2 x(t, Ai, A 2 ) is about QiQ2^og 2 N) 2 /2. 

The wavelet transform envelope |x* ijjx 1 (t)\ is a demodulated signal having 
approximatively the same frequency bandwidth as 1/% ■ Its Fourier transform 
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is mostly supported in the interval [—XiQ 1 1 ,X±Q 1 *] for Ai > 2irQi/T, and in 
[-271-T- 1 ,2-itT- 1 ] for Ai < 2nQ 1 /T. If the support of ipx 2 centered at A 2 does 
not intersect the frequency support of \x-kip\i\, then 

| la;*^ 1*^2 1 »0 . 

One can verify that non-negligible second-order coefficients satisfy 

A 2 < max(AiQ 1 " 1 I SvrT" 1 ) . (28) 

For a fixed i, a direct calculation then shows that there are of the order of 
<3iQ 2 (log 2 N) 2 /2 second-order scattering coefficients. Similar reasoning extends 
this result to show that there are about Q\ . . . Q m (log 2 N) m /m\ non-negligible 
mth-order scattering coefficients. 

To compute S\X and S2 x we first calculate U±x and U2X and average them 
with (/). Over a time frame of duration T, to reduce computations while avoiding 
aliasing, \x *ip\i{t)\ is subsampled at a rate which is twice its bandwidth. The 
family of filters {V-'AiJaigAi covers the whole frequency domain and Ai is chosen 
so that filter supports barely overlap. Over a time frame where x has N samples, 
{\x*ip\ 1 (t)}\ 1 £A 1 then has about 2N samples. Similarly, ||x* VaJ *V'a 2 WI ls 
subsampled in time at a rate twice its bandwidth. The total number of samples 
for all Ai and A 2 stays about 2N. With an FFT, all first- and second-order 
wavelet modulus coefficients and their time averages S\x(t, Ai) and S%x{t, Ai, A 2 ) 
are calculated for t = kT/2 with 0(N log N) operations. 

5 Inverse Scattering 

To better understand the information carried by scattering coefficients, this 
section studies a numerical inversion of the transform. Since a scattering trans- 
form is computed by cascading^ wavelet modulus operators W m , the inversion 
approximatively inverts each W m for m < m. At the maximum depth m = m, 
the algorithm begins with a deconvolution, estimating Umx(t) at all t on the 
sampling grid of x(t), from Smx{kT/2) — UmX * cf)(kT/2). 

Because of the subsampling, one cannot compute U m x from S m x exactly. 
This deconvolution is thus the main source of error. To take advantage of the 
fact that UjnX > 0, the deconvolution is computed with the Richardson-Lucy 
algorithm [29], which preserves positivity if <j> > 0. We initialize yo(t) by inter- 
polating Srnx{kT ' /2) linearly on the sampling grid of x, which introduces error 
because of aliasing. The Richardson-Lucy deconvolution iteratively computes 



Vn+i{t) = y n (t) 



y n *(/> 



yo W(t) 



(29) 



with </>(£) = <f>(—t). It converges to the pseudo-inverse of the convolution opera- 
tor applied to yo, which blows up when n increases because of the deconvolution 
instability. Deconvolution algorithms thus stop the iterations after a fixed num- 
ber of iterations, which is set to 30 in this application. 
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Once an estimation of UmX is ralculated by deconvolution, we compute an 
esti mate x of x by inverting each W m for m > m > 0. As explained in Section 
|3.l[ Wx — [x * (j>, \x * tpx\)xeA can be inverted by taking advantage of the fact 
that wavelet coefficients define a complete and redundant signal representation. 
If | a; -kip\(t)\ = then no phase needs to be recovered. The inversion of W 
is thus more stable when Wx is sparse, which motivates using wavelets ip\ m 
providing a sparse representation at each order m. The inversion of W amounts 
to solving a non-convex optimization problem. Recent convex relaxation ap- 
proaches [7l[44] are able to compute exact solutions, but they require too much 
computation and memory for audio applications. Since the main source of er- 
rors is introduced at the deconvolution stage, one can use an approximate but 
fast inversion algorithm. 

Griffin & Lim |19| showed that an alternating projection algorithm can re- 
cover good quality audio signals from their spectrograms, but with large mean- 
square errors because the algorithm is trapped in local minima. To compute an 
estimation x of x from Wx, this algorithm initializes xq to be a Gaussian white 
noise. For any n > 0, 5 ra +i is calculated from x n by adjusting the modulus of 
its wavelet coefficients 



and by applying the wavelet transform pseudo-inverse (17) 



e n +i = %*<j>* (j>{t) + y^ Real( z\ * ip\(t) ) . (31) 



The dual filters are defined in (18). Numerical experiments are performed with 
n = 32 iterations, and we set x = x n . 

When m — 1, an approximation x of x is computed from from (SoX, Six) 
by first estimating U±x from Six = U\X -k <f> with the Richardson-Lucy decon- 
volution algorithm. We thenjcompute x from Sqx and this estimation of UiX 
by approximatively inverting W\ with the Griffin & Lim algorithm. When T is 
above 100ms, the deconvolution loses too much information, and crude audio 
reconstructions are obtained from first-order coefficients. Figure p)Fa) shows the 
scalograms \og\x-kij}\ x {t)\ of a speech and a music signal, and the scalograms 
log \x*tpxi(t)\ of their approximations x from first-order scattering coefficients. 

When m = 2, the approximation x is calculated from (S x, Six, S 2 x) by ap- 
plying the deconvolution algorithmJ;o S%x = U^x-kcf) to estimate U2X, and then 
by successively inverting Wi and W\ with the Griffin & Lim algorithm. Figure 
[5^c) shows log |x*Vai (t)\ for the same speech and music signals. Amplitude mod- 
ulations, vibratos and attacks are restored with greater precision by incorporat- 
ing second-order coefficients, yielding a much better audio quality compared to 
first-order reconstructons. However, even with m = 2, reconstructions become 
become crude for T > 500ms. Indeed, the number of second-order scattering co- 
efficients Q1Q2 log2 N/2 is too small relatively to the number N audio samples in 
each audio frame, and they do not capture enough information. Examples of au- 



dio reconstructions are available at http://www.di.ens.fr/signal/scattering/audio/ 
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Figure 5: (a): Scalogram log |x * 1/% Wl for recordings of speech (top) and a 
cello (bottom). (b,c): Scalograms log | x* ip\ 1 (t)\ of reconstructions x from first- 
order scattering coefficients (fn = 1) in (b), and from first- and second-order 
coefficients (fn — 2) in (c). Scattering coefficients were computed with T = 
190ms for the speech signal and T = 370ms for the cello signal. 

6 Scattering Spectrum 

Whereas first-order scattering coefficients provide average spectrum measure- 
ments, which are equivalent to the mel-ffequency spectrogram, second-order 
coefficients contain important complementary information. Section |6.1| shows 
that they provide better spectral resolution through interference measurements 
within each mel-scale interval. Section 6.2 proves that S%x{t, Ai, \2)/Six(t, Ai) 



characterizes the modulation spectrum of audio signals. As with MFCCs, com- 
puting the logarithm of scattering coefficients linearly separates all multiplica- 
tive components. 



6.1 Frequency Interval Measurement from Interference 

A wavelet transform has a worse frequency resolution than a windowed Fourier 
transform at high frequencies. However, we show that frequency intervals 
between harmonics are accurately measured by second-order scattering coef- 
ficients. 
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To simplify explanations, we consider a signal x of period To. Squared 
wavelet coefficients can be written 

\x*ip Xl {t)\ 2 = e 2 + e(t) , (32) 

where e 2 is the filtered signal energy 

e 2 = ^- I "\x*i; Xl (t)\ 2 dt 
1 o Jo 

and e(t) is an oscillatory interference term giving the correlation between the 
frequency components in the support of ip Xl . If e <C e 2 then a first-order ap- 
proximation of the square root applied to 1321) gives 



|x**M*)l«e+^ . 

If we assume that the size T of the window tfi satisfies T ^> T Q then e * <f> « 
because e contains only frequencies larger than 2tt/Tq, which are thus outside 
the frequency support [—n/T, n/T] of (j). It results that 

5is(«,Ai) = |a;*VA 1 |*^(t)we, (33) 

and S 2 x(t, Ai, A 2 ) = | |a;* ^ x | * V% I * 0(0 satisfies 

S2z(i,Ai,A 2 )«~|e*VA 2 1 *<£(*)• (34) 

2e 

For example, if x has two frequency components in the support of ip\ 1 , we have 



and (32) then implies that e(t) = 2a\ a 2 cos(£i — ^i)t. Hence 



S 2 x(t, Ai, Aa) ,7 ,. . s, «i«2 



5ia;(t,Ai) Ir * av ~ Wl |ai| 2 + |a 2 



2 



These normalized second-order coefficients are thus non-negligible when A 2 is 
of the order of the distance |£ 2 — £i| between the two harmonics. This shows 
that although the first wavelet ip\ 1 does not have enough resolution to discrim- 
inate the frequencies £i and £ 2 , second-order coefficients detect their presence 
and accurately measure the interval |£ 2 — £i|. As in audio perception, scatter- 
ing coefficients can accurately measure frequency intervals but not frequency 
location. 

If x * V'Ai (t) — Yin a ™ e ^™' nas more frequency components, we verify sim- 
ilarly that Szxit, Ai, A 2 )/S'ia;(t, Ai) is non-negligible when A 2 is of the order of 
|£n — £n'| f° r some n ^ n' '. These coefficients can thus measure multiple fre- 
quency intervals within the frequency band covered by ip\ 1 . If the frequency 
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resolution of ipx 2 is not sufficient to discriminate between two frequency inter- 
vals |£i — £2 1 and |^3 — ^4|, these intervals will interfere and create high amplitude 
third-order scattering coefficients. A similar calculation shows that third-order 
scattering coefficients S 3 x(t, X 1 , A 2 , A3) detect the presence of two such intervals 
within the support of ip\ 3 when A3 is close to ||£i — £2! — I £3 — £i||- They thus 
measure "intervals of intervals." 

Figure [6JJ a) shows the scalogram log|a;*VAi| of a signal x containing a 
chord with two notes, whose fundamental frequencies are ^ = 600Hz and £2 — 
675Hz, followed by an arpeggio of the same two notes. First-order coefficients 
\ogS\x{t,Xi) in Figure plb) are very similar for the chord and the arpeggio 
because the time averaging loses time localization. However they are easily 
differentiated in Figure |6jc), which displays \og(S2x(t, Ai, X2)/S\x(t, X\)) for 
Ai sa £1 = 600Hz, as a function of A 2 . The chord creates large amplitude 
coefficients for A2 = £2 — £1 = 75Hz, which disappear for the arpeggio because 
these two frequencies are not present simultaneously. Second-order coefficients 
have also a large amplitude at low frequencies A2 . These arise from variation of 
the note envelopes in the chord and in the arpeggio, as explained in the next 
section. 

6.2 Amplitude Modulation Spectrum 

Audio signals are usually modulated in amplitude by an envelope, whose vari- 
ations may correspond to an attack or a tremolo. For voiced and unvoiced 
sounds modeled by harmonic sounds and Gaussian noises, we show that ampli- 
tude modulations are characterized by second-order scattering coefficients. 

Let x(t) be a sound resulting from an excitation e(t) filtered by a resonance 
cavity of impulse response h(t), which is modulated in amplitude by a(t) > 
to give 

x(t) =a(t)(e-kh)(t) . (35) 

The impulse response h is typically very short compared to the minimum vari- 
ation interval (sup t |a'(t)|) _1 of the modulation term. Observe that if Ai > 
2irQi/T satisfies 

J \t\ \h(t)\dt\ » ^- » sup \a'(t)\ , (36) 

then a(t) remains nearly constant over the time support of i/j^ and h(ui) is 
nearly constant over the frequency support of ip\ t . ft results that 

la^VMOl ~ |ft(Ai)| |e*VAi(*)|o(*) • (37) 

We shall compute |e* ipx-i I when e(t) is a pulse train or a Gaussian white noise 
and derive the values of first- and second-order scattering coefficients. 

For a voiced sound, the excitation is modeled by a pulse train of pitch £: 



<»)-iE"-T =£«'*' 
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Figure 6: (a): Scalogram log \x * ip\ 1 (t)\ for a signal with two notes, of fun- 
damental frequencies £1 = 600Hz and £2 = 675Hz, first played as a chord and 
then as an arpeggio, (b): First-order scattering coefficients log S\x{t, X\) for 
T = 512ms. (c): Second-order scattering coefficients log(S , 2(i, £1, X.2)/S\{t, \\j) 
with Ai = £1 as a function of t and A2. The chord interferences produce large 
coefficients for A2 = (£2 — £i|- 

Suppose that X1/Q1 <C £ so that the support of tpx 1 covers at most one partial, 



whose frequency fc£ is the closest to Ai. It then results from (37) that 



\x*ip Xx (t)\™\H*i)\\i>M(kO\a(t), 
so S±x(t, Ai) = \x * ipxt I * 4>(t) is given by 



(38) 



(39) 



It is non-zero when Ai is close to a harmonic fc£, and is proportional to |/i(Ai)|. 
Figure[7^a) displays log Ix-kipXtit)] for a signal having three voiced and three 
unvoiced sounds. The first three are produced by a pulse train excitation e(i) 
with a pitch of £ = 600 Hz. Figure^b) shows that log S\x{t, \\) has a harmonic 
structure, with an amplitude depending on log |/i(Ai)|. However, the averaging 
by <p removes the differences between the modulation amplitudes a(t) of these 
three voiced sounds. 



Using ( 38 ) , we compute second-order scattering coefficients 



S 2 x(t,\i ) \2)Kt\h(\i)\\il>\ 1 (k€)\\a*il>x a \*<l>(t) 
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Figure 7: (a): log \x * V'Ax (^) | f or Q> signal with three voiced sounds of 
same pitch £ = 600Hz and same h(t) but different amplitude modulations 
a(t): first a smooth attack, then a sharp attack, then a tremolo of frequency 
7/. It is followed by three unvoiced sounds created with the same h(t) and 
same amplitude modulations a(t) as the first three voiced sounds, (b): First- 
order scattering \ogS\x{t,\\) with T = 128ms. (c): Second-order scattering 
log(S , 2x(t, Ai, \<2)/S\x(t, Ai)) displayed for \\ = 4£, as a function oft and A2. 



which implies that 



S 2 x(t,\ 1 ,\ 2 ) \a*tpx 2 \*4>(t) 



Six(t,Ai) 



a * <j)(t) 



(40) 



Normalized second-order scattering coefficients thus depend only on the am- 
plitude modulation a(t), and compute its wavelet spectrum at all frequencies 
A 2 . 

Figure[7]jc) displays log(52(i, Ai, A 2 )/S'i(i, Ai)) for the fourth partial Ai = 4^, 
as a function of A 2 . The modulation envelope a(t) of the first sound has a 
smooth attack and thus produces large coefficients only at low frequencies A 2 . 
The envelope a(t) of the second sound has a much sharper attack and thus 
produces large amplitude coefficients for higher frequencies A 2 . The third sound 
is modulated by a tremolo, which is a periodic oscillation a(t) = 1 + ecos (r/t). 
According to ( [40] ), this tremolo creates large amplitude coefficients when A 2 = 77, 
as shown in Figure [7Fc) . 

Unvoiced sounds are modeled by excitations e(t) which are realizations of 
Gaussian white noise. The modulation amplitude is typically non-sparse, which 
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means the square of the average of a(i) on intervals of size T is of the order of 
the average of a 2 (t). If XiQ^ 1 3> T^ 1 then Appendix |A| shows 

1/2 

Six(t,\i)f*?-2-W\\\i 1/a \h(\i)\a*<Kt). (41) 

First-order scattering coefficients are again proportional to |/i(Ai)| but do not 
have a harmonic structure. This is shown in Figure Fab) by the last three 
unvoiced sounds. The fourth, fifth, and sixth sounds have the same filter h(t) 
and envelope a(t) as the first, second, and third sounds, respectively, but with 
a Gaussian white noise excitation. 

Appendix LAI also shows that if a(t) is non-sparse and XiQ^ 1 3> T~ l then 

S 2 x(t,X 1 ,X 2 ) _ \a*ipx 2 \ *(/)({) 
S\x(t,\i) a-kifi(t) 

where e(t) is small relatively to the first amplitude modulation term if (4/7T — 
l) 1 ^ 2 (A2<3i) 1 ^ 2 (AiQ2) _1 ^ 2 is small relatively to this modulation term. Voiced 
and unvoiced sounds thus produce similar second-order scattering coefficients. 
This is illustrated by Figure fflc), which shows that the fourth, fifth, and sixth 
sounds have second-order coefficients similar to those of the first, second, and 
third sounds, respectively. The stochastic error term Z produced by unvoiced 
sounds appears as small amplitude random fluctuations in Figure u\c). 

7 Frequency Transposition Invariance 

Audio signals within the same class may be transposed in frequency, as when the 
same word is pronounced by a man or a woman. This frequency transposition is 
a complex phenomenon which affects the pitch and filter differently. The pitch is 
typically translated on a logarithmic frequency scale whereas filters are not just 
translated but also deformed. We thus need a representation which is invariant 
to frequency translation on a logarithmic scale, but also stable to frequency 
deformation. After reviewing the Mel-frequency cepstral coefficient (MFCC) 
approach through the discrete cosine transform (DCT), this section defines such 
a representation with a scattering transform computed along frequency. 

MFCCs are computed from the mel-frequency spectrogram log Mx(t, X) by 
calculating a DCT along the mel-frequency index 7 for a fixed t [16] . This 7 is 
linear in A for low frequencies, but is proportional to log 2 A for higher frequencies. 
For simplicity, we write 7 = log 2 A and A = 2 7 , although this should be modified 
at low frequencies. 

The frequency index of the DCT is called the "quefrency" parameter. Setting 
high-quefrency coefficients to zero is equivalent to averaging logMa;(i, 2 7 ) along 
7, which provides some frequency transposition invariance. The more high- 
quefrency coefficients are set to zero, the bigger the averaging and hence the 
more transposition invariance obtained, but at the expense of losing information. 

We avoid this loss by replacing the DCT with a scattering transform along 7. 
A frequency scattering transform is calculated by iteratively applying wavelet 
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transforms and modulus operators. An analytic wavelet transform of a log- 



frequency dependent signal z( 7 ) is denned as in (13), but with convolutions 
along the log-frequency variable 7 instead of time: 



^ fr z=(z*0 fr ( 7 ),z*^( 7 )) . (42) 



Each wavelet ip q is a band-pass filter whose Fourier transform ip q is centered at 
"qucfrency" q and <f> h is an averaging filter. These wavelets satisfy the condition 



(15), so W is contractive and invertible. 

A scattering transform computes a cascade of wavelet modulus transforms. 



Similarly to (27), we iteratively compute wavelet modulus convolutions 

t/ ft z=(z( 7 ), |z*W(7)Uk*Vvl*<M7)l)- (43) 

Averaging these defines a second-order scattering transform: 

S h z =(z* fr ( 7 ) , \z * V 91 1 * / r (7) , ||* * Vv I * ^ 1 * fr (7)) • (44) 

These coefficients are locally invariant to log-frequency shifts, over a domain 
proportional to the support of the averaging filter </> fr . This frequency scattering 
is formally identical to a time scattering transform, having the same properties 
when exchanging time and log-frequency variables. Numerical experiments are 
implemented with Morlet wavelets ip qi and ip q2 with Qi = Q2 = 1- 

The frequency scattering transform is not applied to MFCCs but to the 
logarithm of normalized first- and second-order time scattering coefficients, as 
computed in Section [3] For audio signals, S x — x * <f) « so we neglect these 
coefficients. We also normalize the scattering transform by dividing second- 
order coefficients by their corresponding first-order coefficients. Abusing nota- 
tion slightly, we still refer to this normalized scattering transform as Sx. As 
explained in Section |6j taking the logarithm separates signal components, such 
as amplitude modulations. As with MFCCs, the frequency scattering is com- 
puted along the log-frequency parameter 7. For a fixed frequency Ai = 2 7 and 
time t, the second-order log-normalized time scattering vector is 

/ io g (|z*<M*<K*)) \ 

logSx(i, 7 ) = . / |jx*jfa7 |*^2J*0(tn (45) 

V i0g V \x*4>vy\*<t>(.t) J J X2 

This is a multidimensional vector of signals 2(7), depending only on 7 for a fixed 
t and A2. Let us transform each z( 7 ) by the frequency scattering operators U 



and S* defined in @ and @. We let C/ fr log 5x(i, 7) and S ffr log5'x(f, 7 ) 
stand for the concatenation of these transformed signals for all t and A2 . The 
representation S h ' log Sx thus cascades a scattering in time and in frequency, so 
it is locally translation invariant in time and in log-frequency, as well as stable 
to time and frequency deformations. The interval of time invariance is defined 
by the size of the time averaging window (f>, whereas its frequency invariance 
depends upon the width of the frequency averaging window <fi . 
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Figure 8: A time and frequency scattering representation is computed by applying 
a normalized temporal scattering S on the input signal x(t), a logarithm, and a 
scattering along log-frequency without averaging. 

For different tasks, frequency transposition may help classification or may 
destroy important information, the latter being the case for speaker identifica- 
tion. The size of the frequency averaging filter <f> h should thus depend upon the 
task. Next section explains how this is learned at the classification stage. 



8 Classification 

This section compares the classification performance of support vector machine 
classifiers applied to scattering representations with standard low-level features 
such as A-MFCCs or state-of-the-art representations. Section [8~T] explains how 
to automatically adapt invariance parameters, while Sections |8.2| and |8.3| present 
results for musical genre classification and phone identification, respectively. 



8.1 Adapting Time and Frequency Transposition Invari- 
ance 

The amount of time shift and frequency transposition invariance depends on 
the classification problem, and may vary for each signal class. This adaptation 
is implemented by a supervised classifier, applied to the time and frequency 
scattering representation. 

Figure 15] illustrates the computation of a time and frequency scattering rep- 
resentation. The scattering transform Sx of an input signal x is computed 
along time, with averaging scale T, and sampled at time intervals T/2. The 
transform is normalized and a logarithm is applied to separate multiplicative 
factors. Scattering coefficients are indexed by their log- frequency parameter 7, 



which defines the vector log Sx(t, 7) in (45 1. For a fixed t, U log Sx(t,j) is 
calculated as in (43), with wavelet convolutions along the parameter 7. Av- 



eraging U log Sx(t,j) with fr (7) computes a frequency scattering transform 
S lr log Sx(t, 7), which is locally invariant to frequency transposition. 

Since we do not know in advance how much transposition invariance is 
needed for a particular classification task, the final frequency averaging is adap- 
tively computed by the supervised classifier, which takes for input {U lr log Sx(t, r y)} 1 
for each time frame t. The supervised classification is implemented by a support 
vector machine (SVM). A binary SVM classifies a feature vector by calculat- 
ing its position relative to a hyperplane, which is optimized to maximize class 
separation given a set of training samples. It thus computes the sign of an op- 
timized linear combination of the feature vector coefficients. With a Gaussian 
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kernel of variance a 2 , the SVM computes different hypcrplancs in different balls 
of radius a in the feature space. The coefficients of the linear combination thus 
vary smoothly with the feature vector values. Applied to {U b log Sx(t, 7)} 7 , the 
SVM optimizes the linear combination of coefficients along 7, and can thus ad- 
just the amount of linear averaging to create frequency transposition invariant 
descriptors which maximize class separation. A multi-class SVM is computed 
from binary classifiers using a one-versus-one approach. All numerical experi- 
ments use the LIBSVM library [8]. 

We can also use this approach to automatically adjust the wavelet octave 
resolution Qi . We compute the time scattering for several values of Q\ , and con- 
catenate the coefficients in a single feature vector. A filter bank with Q\ = 8 has 
enough frequency resolution to separate harmonic structures, whereas wavelets 
with Q\ — 1 have a smaller time support and can thus better localize transient 
in time. By calculating linear combinations of feature vector coefficients, the 
SVM can amplify the coefficients corresponding to a better Q 1; depending upon 
the type of structure needed to best discriminate a given class. In the exper- 
iments described below, adding more values of Q\ between 1 and 8 provides 
marginal improvements. 

Classification results can also be improved by adapting the averaging size 
T of the time scattering, for each signal class. For example, a phone duration 
may range from 10ms to 200ms and shorter phones are better discriminated 
with scattering coefficients calculated with smaller T. We thus concatenate 
scattering transforms computed for several T, letting the SVM amplify scatter- 
ing coefficients computed with a T that is best adapted to each class. In the 
experiments, this adaptivity is implemented with three values of T. 

8.2 Musical Genre Classification 

Scattering feature vectors are first applied to musical genre classification prob- 



lem on the GTZAN dataset 43 . The dataset consists of 1000 thirty-second 
clips, divided into 10 genres of 100 clips each. Given a clip, the goal is to find 
its genre. 

A set of feature vectors is computed over half-overlapping audio frames of 
duration T. Each frame of a clip is classified separately by a Gaussian kernel 
SVM, and the clip is assigned to the class which is most often selected by its 
frames. To reduce the SVM training time, feature vectors were only computed 
every 370ms for the training set. The SVM slack parameter and the Gaussian 
kernel variance are determined through cross-validation on the training data. 
Table [2] summarizes results one run of ten-fold cross-validation. It gives the 
average error and its standard deviation. 

A A-MFCC vector represents an audio frame of duration T at time t by 
three MFCC vectors centered at t - T/2, t and t + T/2. When computed 
for T = 23ms, the A-MFCC error is 19.3%, which is reduced to 17.8% by 
increasing T to 370ms. This is because the frames must be sufficiently long 
to include enough musical structure. Further increasing T does not reduce the 
error because the underlying ergodic stationarity hypothesis is strongly violated. 
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Representations 


GTZAN 


TIMIT 


A-MFCC (T = 23ms) 


19.3 ± 4.2 


19.3 


A-MFCC (T = 370ms) 


17.8 ± 4.2 


66.1 


State of the art (excluding scattering) 


9.4 ± 3.1 p6J 


16.7 J9J 




T = 370ms 


T = 32ms 


Time Scat., m = 1 


17.9 ± 4.2 


18.5 


Time Scat., m — 2 


12.3 ± 2.7 


17.7 


Time Scat., fn = 3 


10.7 ± 2.0 


18.7 


Time & Freq. Scat., fn = 2 


10.3 ± 2.3 


16.5 


Adapt Qi, Time & Freq. Scat., m = 2 


9.0 ± 2.0 


16.1 


Adapt Qi,T, Time & Freq. Scat., m = 2 


8.1 ± 2.3 


15.8 



Table 2: Error rates (in percent) for musical genre classification on GTZAN 
and for phone identification on the TIMIT database for different features. Time 
scattering transforms are computed with T — 370ms for GTZAN and with 
T = 32ms for TIMIT. 

State-of-the-art algorithms provide refined feature vectors to improve clas- 
sification. For example, combining MFCCs with stabilized modulation spectra 
and performing linear discriminant analysis, |26] obtains an error of 9.4%, the 
best result so far. In [2ll, a deep belief network is trained on spectrograms, 
achieving a 15.7% error with an SVM classifier. Finally, 22 , sparse representa- 



tion on a constant-Q transform, giving a 16.6% error using an SVM. A wavelet 
scattering does not need any learning because the nature of time and frequency 
invariants is known and leads to an optimized choice of wavelet filters. It reduces 
computations and improves results relatively to a learning approach. 

Table [2] gives classification errors for different scattering feature vectors. For 
m = 1, they are composed of first-order time scattering coefficients computed 
using Gabor wavelets with Qi = 8 and T — 370ms. These vectors are similar 



to an MFCCs as shown by (11). As a result, the classification error of 17.9% 
is close to that of MFCCs for the same T. For fn = 2, we add second-order 
coefficients computed using Morlet wavelets with Q 2 = 2. It reduces the error 
to 12.3%. This 30% error reduction shows the importance of second-order co- 
efficients for relatively large T. Third-order coefficients are also computed with 
Morlet wavelets with Q3 = 1. For fn = 3, including these coefficients reduces 
the error marginally to 10.7%, at a significant computational and memory cost. 
We thus restrict ourselves to fn = 2. 

Musical genre recognition is a task which is partly invariant to frequency 
transposition. Incorporating a scattering along the log-frequency variable, for 
frequency transposition invariance, reduces the error by about 20%. These 
errors are obtained with a first-order scattering along log-frequency. Adding 
second-order coefficients improves results marginally. 

Providing adaptivity for the wavelet octave bandwidth Qi by computing 
scattering coefficients for both Q 1 = 1 and Q\ = 8 further reduces the error 
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by about 10%. Indeed, music signals include both sharp transients and narrow- 
bandwidth frequency components. Further enriching the representation by con- 
catenating scattering coefficients for T = 370ms, 740ms, 1.5s also reduces the 
error rate, which is to be expected since musical signals contain structures at 
both short and long scales. This yields an error rate of 8.1%, which compares 



favorably to the non-scattering state-of-the-art of 9.4% error 26 . 

Replacing the SVM with more sophisticated classifiers can improve results. 
The error rate from second-order time-scattering coefficients is reduced from 



12.3% to 8.8% in 10 , with a sparse representation classifier. 



8.3 Phone Recognition 

The same scattering representation is tested for phone recognition with the 



TIMIT corpus 18 . The dataset contains 6300 phrases, each annotated with 
the identities, locations, and durations of its constituent phones. Given the 
location and duration of a phone, the goal is to determine its class according to 



the standard protocol 13 28 . The 61 phone classes (excluding the glottal stop 
/q/) are collapsed into 48 classes, which are used to train and test models. To 
calculate the error rate, these classes are then mapped into 39 clusters. Training 
is achieved on the full 3696-phrase training set, excluding "SA" sentences. The 
Gaussian kernel SVM parameters are optimized by validation on the standard 



400-phrase development set 20 . The error is then calculated on the core 192- 
phrase test set. 

An audio segment of length 192ms centered on a phone can be represented 
as an array of MFCC feature vectors with half-overlapping time windows of 
duration T. This array, with the logarithm of the phone duration added, is 
fed to the SVM. Table [| shows that T = 23ms yields a 19.3% error which is 
much less than the 66.1% error for T = 370ms, since many phones have a short 
duration with highly transient structures. 

A lower error of 17.1% is obtained by replacing the SVM with a sparse repre- 



sentation classifier on MFCC-like spectral features 38 . Combining MFCCs of 
different window sizes and using a committee-based hierarchical discriminative 
classifier, [9] achieves an error of 16.7%, the best so far. Finally, convolutional 
deep-belief networks cascades convolutions, similarly to scattering, on a spec- 
trogram using filters learned from the training data. These, combined with 
MFCCs, yield an error of 19.7%. 

Rows 4 through 6 of Table [2] gives the classification results obtained by 
replacing MFCC vectors with a time scattering transform computed with first- 
order Gabor wavelets with Qi = 8. Second- and third-order scattering coeffi- 
cients are calculated with Morlet wavelets with Q% = Q3 = 1. The best results 
are obtained with T — 32ms. For m = 1, we only keep first-order scattering co- 
efficients and get a 18.5% error, similar to that of MFCCs. The error is reduced 
by about 5% with m — 2, a smaller improvement than for GTZAN because scat- 
tering invariants are computed on smaller time interval T = 32ms as opposed to 
370ms for music. Second-order coefficients carry less energy when T is smaller, 
as shown in Table [T] For the same reason, third-order coefficients provide even 
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less information compared to the GTZAN case, and do not improve results. 

For m — 2, cascading a log-frequency transposition invariance computed 
with a first-order frequency scattering transform of Section [7] reduces the error 
by about 5%. Computing a second-order frequency scattering transform only 
marginally improves results. Allowing to adapt the wavelet frequency resolution 
by computing scattering coefficients with Qi — 1 and Q\ = 8 also reduces the 
error by a small amount. Finally, adapting the interval T further improves 
results because different phones often have very different durations. This is done 
by aggregating scattering coefficients computed for T = 32ms , 64ms , 128ms. 

9 Conclusion 

The success of MFCCs for audio classification can partially be explained by 
their stability to time-warping deformation. Scattering representations extend 
MFCCs by recovering lost high frequencies through successive wavelet convo- 
lutions. It provides modulation spectrum measurements which are stable to 
time-warping deformation, and it carries the whole signal energy. The logarithm 
of second-order scattering coefficients characterizes amplitude modulations, in- 
cluding transient phenomena such as attacks. Over T w 200ms, good audio 
signal quality is recovered from first- and second-order scattering coefficients. 
A frequency transposition invariant representation is obtained by cascading a 
second scattering transform along frequencies. Time and frequency scattering 
feature vectors yield state-of-the-art classification results with a Gaussian kernel 
SVM, for musical genre classification on GTZAN, and phone identification on 
TIMIT. 



A Modulation Spectrum Properties 

This appendix gives approximations of first- and second-order scattering coeffi- 
cients produced by x(t) = a(t) (e*/i)(£), for a Gaussian white noise excitation 

e(t). 



We saw in (37) that 

|s*VAi(«)l«|A(Ai)||e*VAi(*)|o(«) • (46) 

Let us decompose 

\e*^ Xl (t)\=E(\e*iP Xl \) + e(t) , (47) 

where e(t) is a zero-mean stationary process. Since e(t) is a normalized Gaussian 
white noise, e*ip\i(t) is a Gaussian random variable of variance IIV'aJI 2 - It 
results that | e -*- ip\ 1 (t) | and e(t) have a Rayleigh distribution, and since ip is a 
complex quadrature phase wavelet, one can verify that 
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Inserting ( 47 1 and this equation in ( 46 1 shows that 



k^^A^t)! ~ |/i(A 1 )|(7r 1 / 2 2- 1 ||^ Al ||a(t) + a(t) e (t)) . (48) 

When averaging with </>, we get 

S 1 a;(t,A 1 )«|/;(A 1 )|(7r 1 / 2 2- 1 ||V)A 1 ||a*#) + (ae)*^)) . (49) 

We are going to show that if T _1 -C AiQ^ 1 and a(t) is not sparse, which means 
the square of the average of a(t) on intervals of size T is of the order of the 
average of a 2 (t) , then 

E(|(ae)*«Kt)| 2 ) 



hMI 2 |a*0(t)| 5 



< 1 (50) 



which implies (41 1. We give the main arguments to compute the order of mag- 
nitudes of the stochastic terms, but it is not a rigorous proof. Computations 
rely on the following lemma. 

Lemma 1. Let z{t) be a zero-mean stationary process of power spectrum R z (uj). 
For any deterministic functions a(t) and h{t) 

E(\(za) * h(t)\ 2 ) <supR z (oj)\a\ 2 *\h\ 2 (t) . (51) 

UJ 

Proof. Let R z (t) = E(z(t) z(t + t)), 

E(\(za) * h(t)\ 2 ) = J R z (v - u) a(u) h(t - u) a{v)* h(t - v)* dudv 

and hence 

E(\(za) * h{t)\ 2 ) = (R z y t , y t ) with y t (u) = a(u)h(t - u). 

Since R z is the kernel of a positive symmetric operator whose spectrum is 
bounded by sup w R z (uj) it results that 

E(|(za)*/i(*)| 2 )<sup^(w)||y t || 2 = sup^(u;)|a| 2 *|/i| 2 (t) . 

UJ UJ 

□ 

Since e(t) is a normalized white noise, e-kip\ x is a Gaussian process and e(i) 
is a stationary Rayleigh process. With a Gaussian chaos expansion, one can 
verify [6] that sup^ |i? e (w)| < 1 — 7r/4. Applying Lemma 111 to z — e and h = <f> 
gives 

E(|( ea )*0(t)| 2 )<(l-7r/4)H 2 *H 2 (t). 

Since (f> bas a duration T, it can be written as 4>(t) = T^ 1 (f> a (T~ 1 t) for some <po 
of duration 1. As a result, if the square of the average of a(t) is of the order of 
the average of a 2 (t) then 

\a\ 2 *W(t) 1 



\a*<f)(t)\ 2 T 
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(52) 



The frequency support of ip\ 1 is proportional to AiQ 1 1 , so we have 



AiQ 1 x . Together with ( p52J ), if T -1 < Ai^ x it proves (fSOJ) and hence 



5ix(i,Ai) 



M|Ai 1/a |fc(Ai)|a*#t) 



(53) 



Let us now compute ^^(t, Ai,A2) = ||x*V'A 1 |*V'A 2 |*'/ , (i)- If T 1 <C \iQ 1 1 
then ( 53 1 together with ( 48 1 shows that 



S 2 x(t, Ai,A 2 ) |a*^A 2 l*0(*) 



5iz(i,Ai) 



a* (f)(t) 



m 



where 



o<?(i) < ^K_ ae WA 2 |*<M0 



(54) 
(55) 



7rVa||^ Al || *^(t) ■ 
Observe that 

£;(|(ae)*^ A2 |*0(t))= J B(|( a e)*^ A2 (t)|)<i?(|( a e)*^ 2 (t)| 2 ) 1 / 2 . 

Lemma [l] applied to z = e and /i = ^> 2 gives the following upper bound: 

E(|(ae)*VA 2 (i)| 2 ) < (1 - *r/4) |a| 2 * |^A 2 | 2 (i) ■ (56) 

One can write |'0A 2 (i)| = A2Q2 ^(^2QJ *) where 6{t) satisfies J 9(t) dt ~ 1. 
Similarly to (52), we thus verify that if the square of the average of a(t) is of 
the order of the average of a 2 (t) then 



H 2 *IV%l 2 ffl h 

|a*<Hi)| 2 ^Q 2 ' 



(57) 



Since ||^aJ 2 ~ AiQ^ 1 , it results from ( |55|56|57| ) that < E(e(*)) < C(4/tt 
l) 1/2 (A 2 Qi) 1 / 2 (A 1 Q 2 )- 1 / 2 withC~l. 
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